Assignment 4

The goal of this assignment is to replicate four plots about different topics as closely as possible using ggplot2. Do not worry if you cannot replicate the plots 100%, the main purpose of this assignment is to learn more about the ggplot2 package and visualisation. Trying to replicate plots is an excellent way to do this.

suppressMessages(library(ggthemes))
suppressMessages(library(tidyverse))
suppressMessages(library(scales))
suppressMessages(library(ggplot2))
suppressMessages(library(ggrepel))
suppressMessages(library(socviz))
suppressMessages(library(Hmisc))

1. Chaos

In this exercise, you will create one of the most famous plots in chaos theory. The equation of the logistic map is very simple, but its behaviour is stunningly complex:

\[ x_{n+1} = rx_{n}(1-x_{n}) \]

Starting with an initial value of \(x_{0}\) between one and zero, e.g. 0.5, and setting a constant value of r e.g. between zero and four, this equation can be iterated forward. If you are interested in the topic, there are plenty of videos discussing it, for example here or here.

The goal is to create a plot with different values of r on the x-axis and then x values on the y-axis corresponding to each r value. In parts of the plot, all these x values will be on a single point, but for other r values x moves perpetually.

The following code cell computes the dataset for you. You are welcome to study the code, but this is not part of the assignment and you do not have to worry about how exactly it works (this is not a course about chaos theory after all). The only part to do is to plot the data such it resembles the figure below. Data is already in a tidy format, one variable denotes the value of r, one variable the value of the associated x’s. For each value of r there are n=1000 observations/rows of x values (these can be constant or fluctuating, depending on the value of r).

# x observations for each r value
n <- 1000
# Step between each r value
r_step <- 0.001

r_range <- seq(2.5, 4, by = r_step)
to_discard <- 500 # numbers of observations discarded before the n which are stored
data <- matrix(0, nrow = n*length(r_range), 2)
for (r in r_range) {
  
  logistic_map_series <- numeric(n+to_discard)
  logistic_map_series[1] <- 0.5
  
  for (k in 1:(n+to_discard-1)) {
    
    logistic_map_series[k+1] <- r*logistic_map_series[k]*(1-logistic_map_series[k])
    
  }
  
  start_index <- 1+n*(match(r, r_range) - 1)
  end_index <- n*match(r, r_range)
  
  data[start_index:end_index,1] <- r
  data[start_index:end_index,2] <- tail(logistic_map_series,n)

}

data <- as_tibble(data.frame(data))
colnames(data) <- c("r", "x")

Hint: Create your final dataset with n <- 1000 and r_step <- 0.001, however, for these values it takes R some time to compute the plot. When building your plot, adjusting axes, colours, etc., one approach is to first use e.g. n <- 10 and r_step <- 0.01 until you have a version of the plot that you are happy with. Just note that the opacity parameter will have to be decreased again once you have increased n because now there are more points in the plot.

Hint 2: Once you have computed a final version, set the code cell opening tag to “{r, cache = TRUE}” which should store the result from your final plot and avoid the file running again for long times when e.g. knitting.

# Make the plot
plot <- ggplot(data, aes(x = r, y = x))

# Make it look nice
chaos_plot <- plot + 
  geom_point(color = "dark green", size = 0.00001, alpha = 0.01) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line.x = element_line(colour = "black", size = 0.3),
        axis.ticks.x = element_line(colour = "black", size = 0.3),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.y = element_blank())

ggsave("chaos-jonathan.png", chaos_plot, path = "plots", width = 8, height = 5)

2. American elections and the electoral college

The winner of the American presidential elections is determined through a system called the electoral college. Each of the 50 states (and Washington DC) is allocated a number of electoral college votes (adding to 538). States with higher population get more college votes than small states. The candidate that gets the most votes in a state, wins all of the state’s electoral college votes (except Maine and Nebraska). A candidate wins the election if they get at least 270 electoral college votes (50% + 1). See more here, but don’t forget—this exam is about visualisation.

An interesting consequence of these rules is that a candidate can win the election without getting the most votes at the national level. This happened in 2016, when Donald Trump won a majority of electoral college votes although more people voted for Hillary Clinton.

In this plot, we will analyse results of American presidential elections since the 1860’s. We will first plot the results for all winning presidential candidates, showing the share of the popular vote on the x-axis and the share of the electoral college votes on the y-axis. We will then highlight the results of the last two elections.

Note: The data for 2020 are provisional as the counting is stil not concluded in some states (e.g. Georgia) or contested at the Supreme Court.

Hint: For placing names of presidents around the dots, see ggrepel package.

eldata <- elections_historic %>%
  select(2:6) %>%
  filter(year > 1859) %>%
  add_row(year = 2020,
          winner = "Joe Biden",
          win_party = "Dem.",
          ec_pct = 0.569,
          popular_pct = 0.519)

# Random seed to recreate the exact same layout every time
set.seed(41)

# Make the plot
election_plot <- ggplot(eldata, aes(x = popular_pct, y = ec_pct)) +
  geom_point(data = eldata[eldata$year < 2015,], aes(colour = win_party), alpha = 0.2, size = 2.25) +
  geom_text_repel(data = subset(eldata, year < 2015), aes(label = winner, color = win_party, alpha = 0.2), size = 2.25, box.padding = 0.2, point.padding = 0.2, force = 1, direction = "both") +
  geom_point(data = eldata[eldata$year > 2015,], aes(colour = win_party), size = 5) +
  geom_text_repel(data = subset(eldata, year > 2015), aes(label = paste(str_extract_all(winner, "\\s.+"), year), color = win_party), size = 5, box.padding = 1.5, force = 1, direction = "both") +
  scale_colour_manual(values = c("Rep." = "#e91c24", "Dem." = "#1422be")) +
  geom_hline(yintercept=0.5, color = "grey", size = 1.75, alpha = 0.75) + 
  geom_vline(xintercept=0.5, color = "grey", size = 1.75, alpha = 0.75) +
  labs(title = "Presidential Elections: Popular  & Electoral College Margins", subtitle = "1860-2020", caption = "Data for 2020 are provisional.") + xlab("Winner's Share of Popular Vote") + ylab("Winner's Share of Electoral College Votes") +
  expand_limits(y = 0.4) +
  scale_x_continuous(breaks = c(0.4, 0.5, 0.6), labels = c("40%", "50%", "60%")) +
  scale_y_continuous(breaks = c(0.4, 0.6, 0.8, 1.0), labels = c("40%", "60%", "80%", "100%")) +
  theme(panel.border = element_rect(colour = "black", fill=NA),
        panel.background = element_blank(),
        panel.grid = element_line(colour = "grey", size = 0.1),
        plot.title = element_text(size = 16),
        plot.subtitle = element_text(size = 14),
        plot.caption = element_text(size = 10),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 11),
        legend.position = "none",
        aspect.ratio = 1,)

ggsave("electoral-college-jonathan.png", election_plot, path = "plots", width = 8, height = 8)

#increase size and font

3. Popularity metrics by party and gender

For this third exercise, you will have to replicate a graph that displays the average popularity metrics of legislators grouped by gender and party. Note that this example involves some reshaping of the data which you can do with dyplr from the tidyverse.

# Data for the plot
df <- read.csv("data/fb-congress-data.csv", stringsAsFactors=FALSE)

#Exclude independent candidate & key gender and party
df_rep_dem <- df %>%
  filter(party != "Independent") %>%
  mutate(gender_party_key = ifelse(party == "Democrat" & gender == "F", 1,
                            ifelse(party == "Democrat" & gender == "M", 2,
                            ifelse(party == "Republican" & gender == "F", 3,
                            ifelse(party == "Republican" & gender == "M", 4, 9999)))))


df_tidy <- df_rep_dem %>%
  pivot_longer(cols = colnames(df_rep_dem)[5:12],names_to = "engagement_type", values_to = "engagement_count")
  
df_tidy$engagement_type <- factor(df_tidy$engagement_type, levels = c("likes_count", "comments_count",  "shares_count", "love_count", "haha_count", "wow_count", "angry_count", "sad_count"))
colours_mg_ml <- c("1" = "#00178b", "2" = "#0332ff", "3" = "#8b0f00", "4" = "#ff2501")

fb_metrics_plot <- ggplot(df_tidy, aes(x = gender_party_key, y = engagement_count)) +
  stat_summary(fun.data=mean_sdl, geom="bar") +
  scale_fill_manual(values = colours_mg_ml) +
  scale_x_continuous(breaks = c(1, 2, 3, 4), labels = c("D-F", "D-M", "R-F", "R-M")) +
  facet_wrap(~engagement_type, nrow=2, scales = "free_y") + 
  aes(fill = as.factor(gender_party_key)) +
  labs(title="Partisan asymmetries by gender in Facebook popularity metrics", subtitle = "Female Democrats receive more engagement than Male Democrats. The opposite is true for Republicans.", x="Party and gender of Member of Congress", y="Average of each type of social metric") +
  theme_minimal() +
  theme(legend.position = "none",
        panel.grid.major = element_line(size = 0.35),
        panel.grid.minor.x = element_blank(),
        axis.title = element_text(size = 8.5),
        axis.text = element_text(size = 7),
        plot.title = element_text(size = 10.5),
        plot.subtitle = element_text(size = 9),
        strip.text = element_text(size = 7),
        aspect.ratio = 0.8)
  
ggsave("party-gender-FB-metrics-jonathan.png", fb_metrics_plot, path = "plots", width = 8, height = 4)
## Warning: Removed 590 rows containing non-finite values (stat_summary).

4. Ideology of presidential candidates in the US

For the last exercise, please replicate the plot below, which Pablo prepared for a Washington Post blog post a few years ago.

The plot combines two sources of data: The ideology estimates for each actor (available in ideology.csv) and a random sample of ideology estimates for the three density plots (in ideology2.csv).

As a clue, Pablo used theme_tufte from the ggthemes package as main theme (which he then edited manually). But there may be other ways of replicating it.

# Data for main plot
ideology <- read.csv("data/ideology.csv")

# Data for background plots
bg <- read.csv("data/ideology2.csv")

names(bg)[1] <- "twscore"

#Disable scientific notation
options(scipen = 999)

# Compute the mean for democrats and republicans
bg %>%
  group_by(type) %>%
  summarise(mean = mean(twscore))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##   type            mean
##   <chr>          <dbl>
## 1 Democrat   -1.01e+ 0
## 2 Republican  7.87e- 1
## 3 Z          -7.29e-17
#Make the plot
ideology_plot <- ggplot() +
  geom_density(data = bg, aes(x = twscore, fill = type, color = "white"), alpha = 0.3) +
  geom_pointrange(data = ideology, aes(x = twscore, y = as.numeric(row.names(ideology))/20, xmin = twscore-twscore.sd*2, xmax = twscore+twscore.sd*2, colour = party, fill = factor(party)), size=0.2) +
  geom_text(data = ideology, aes(x = twscore, y = as.numeric(row.names(ideology))/20, label = screen_name, color = party), size = 2.5, x = ifelse(ideology$twscore > 0, ideology$twscore + ideology$twscore.sd, ideology$twscore - ideology$twscore.sd), hjust = ifelse(ideology$twscore > 0, -0.25,  1.35), family = "serif") +
  scale_fill_manual("type", limits = c("Democrat", "Republican", "Z"), values = c("blue", "red", "black")) +
  scale_color_manual("party", limits = c("Democrat", "Republican", "Z"), values = c("blue", "red", "black")) +
  scale_x_continuous(breaks = c(-2,-1,0,1,2), limits = c(-2.5,2.5)) + 
  labs(title = "Twitter ideology scores of potential Democratic and Republican presidential primary candidates") +
  xlab("Position on latent ideological scale") +
  geom_text(aes(x=0, label="Average Twitter User", y=0.35), angle=90, vjust = -0.5, size=2.5) + 
  geom_text(aes(x=0.78739357121547215445645, label="Average Republican \n in 114th Congress", y=1.4), angle=90, vjust = -0.2, size=2.5) + 
  geom_text(aes(x=-1.01212724745149640348529, label="Average Democrat in \n 114th Congress", y=0.35), angle=90, vjust = -0.2, size=2.5) +
  geom_segment(aes(x = 0, y = 0, xend = 0, yend = 1.6), size = 0.3, alpha = 0.3) +
  geom_segment(aes(x = 0.78739357121547215445645, y = 0, xend = 0.78739357121547215445645, yend = 1.6), size = 0.3, colour = "red", alpha = 0.3) +
  geom_segment(aes(x = -1.01212724745149640348529, y = 0, xend = -1.01212724745149640348529, yend = 1.6), size = 0.3, colour = "blue", alpha = 0.3) +
  #theme_minimal() +
  theme_tufte() +
  theme(panel.grid = element_blank(),
        legend.position = "none",
        axis.title.y = element_blank(),
        axis.line.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.title = element_text(vjust = -5, hjust = 0.5, face = "bold", size = 10))

ggsave("ideology-plot-jonathan.png", ideology_plot, path = "plots", width = 8, height = 5)
## Warning: Removed 392 rows containing non-finite values (stat_density).